{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring the Natality Dataset \n",
    "\n",
    "**Learning Objectives**\n",
    "\n",
    "- Explore a BigQuery dataset inside Jupyter notebook\n",
    "- Read data from BigQuery into Pandas dataframe\n",
    "- Examine the distribution of average baby weight across various features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In this notebook we'll read data from BigQuery into our notebook to begin some preliminary data exploration of the natality dataset. To beign we'll set environment variables related to our GCP project. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "BUCKET = \"cloud-training-bucket\"  # Replace with your BUCKET\n",
    "REGION = \"us-central1\"            # Choose an available region for Cloud MLE\n",
    "TFVERSION = \"1.14\"                # TF version for CMLE to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"BUCKET\"] = BUCKET\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"REGION\"] = REGION\n",
    "os.environ[\"TFVERSION\"] = TFVERSION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "if ! gcloud storage ls | grep -q gs://${BUCKET}/; then\n",
    "    gcloud storage buckets create --location=${REGION} gs://${BUCKET}\n",
    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Explore data </h2>\n",
    "\n",
    "The data is natality data (record of births in the US). My goal is to predict the baby's weight given a number of factors about the pregnancy and the baby's mother.  Later, we will want to split the data into training and eval datasets. The hash of the year-month will be used for that -- this way, twins born on the same day won't end up in different cuts of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create SQL query using natality data after the year 2000\n",
    "query_string = \"\"\"\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks,\n",
    "    FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING))) AS hashmonth\n",
    "FROM\n",
    "    publicdata.samples.natality\n",
    "WHERE\n",
    "    year > 2000\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call BigQuery and examine in dataframe\n",
    "from google.cloud import bigquery\n",
    "bq = bigquery.Client(project = PROJECT)\n",
    "\n",
    "df = bq.query(query_string + \"LIMIT 100\").to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's write a query to find see how the number of babies and their average weight is distributed across a given field. This is important to ensure that we have enough examples of each data value, and to verify our hunch that the parameter has predictive value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 1**\n",
    "\n",
    "Consider the field `is_male`. Write a query that will return the unique values for that field (i.e. `False` or `True`) as well as the count of the number of babies for each value and the average weight for each value. \n",
    "\n",
    "For example, your query should return the following result:\n",
    "\n",
    "  is_male | num_babies  | avg_wt  \n",
    "---------|---------------|-----------\n",
    " False  | 16245054  | 7.104715  \n",
    " True  | 17026860  | 7.349797 \n",
    " \n",
    " **Hint**: Look at the usage of operations like COUNT(), AVG() and GROUP BY in SQL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"\"\"\n",
    "SELECT\n",
    "    # TODO: Your code goes here\n",
    "FROM\n",
    "    publicdata.samples.natality\n",
    "WHERE\n",
    "    year > 2000\n",
    "\"\"\"\n",
    "\n",
    "df = # TODO: Your code goes here\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 2**\n",
    "\n",
    "Now write a function that will generalize your query above. That is, write a function that will take a column name from the natality dataset and return a dataframe containing the unique values for that field as well as the count of the number babies for each value and the average weight for each value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_distinct_values(column_name):\n",
    "    sql_query = \"\"\"\n",
    "    # TODO: Your code goes here\n",
    "    \"\"\".format(column_name)\n",
    "    return # TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 3**\n",
    "\n",
    "Use the `get_distinct_values` function you created above to examine the distributions for the fields `is_male`, `mothers_age`, `plurality` and `gestation_weeks`. After creating the dataframe, plot your results using `df.plot`. Have a look at [the documentation for `df.plot` here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a line plots to see mother_age with avg_wt linear and num_babies logarithmic\n",
    "df = # TODO: Your code goes here\n",
    "\n",
    "df = df.sort_values(\"mother_age\")\n",
    "df.plot(# TODO: Your code goes here\n",
    "df.plot(# TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a bar plot to see plurality(singleton, twins, etc.) with avg_wt linear and num_babies logarithmic\n",
    "df = # TODO: Your code goes here\n",
    "\n",
    "df = df.sort_values(\"plurality\")\n",
    "df.plot(# TODO: Your code goes here\n",
    "df.plot(# TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use a bar plot to see gestation_weeks with avg_wt linear and num_babies logarithmic\n",
    "df = # TODO: Your code goes here\n",
    "\n",
    "df = df.sort_values(\"gestation_weeks\")\n",
    "df.plot(# TODO: Your code goes here\n",
    "df.plot(# TODO: Your code goes here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion \n",
    "\n",
    "All these factors seem to play a part in the baby's weight. Male babies are heavier on average than female babies. Teenaged and older moms tend to have lower-weight babies. Twins, triplets, etc. are lower weight than single births. Preemies weigh in lower as do babies born to single moms. In addition, it is important to check whether you have enough data (number of babies) for each input value. Otherwise, the model prediction against input values that doesn't have enough data may not be reliable.\n",
    "<p>\n",
    "In the next notebook, we will develop a machine learning model to combine all of these factors to come up with a prediction of a baby's weight."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2017-2018 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
